Distributions

Raoul Grouls

Overview

  • Distributions
    • difference between discrete and continuous
    • mass vs density functions
  • Discrete
    • Bernoulli
    • Uniform
    • Poisson
  • Continuous
    • (Log)Normal
    • Exponential
    • Beta

Overview

  • Modelling
    • Motivation
    • Epistemic vs Aleatoric
    • Modelling under uncertainty

Distributions

Distributions

A probability distribution is:

a mathematical description

of a random phenomenon

in terms of all its possible outcomes

and their associated probabilities

Distributions

In this image, what do you expect to be:

  • all the possible outcomes?
  • the associated probabilities?

Distributions: types

The main types of distributions are:

  • Discrete : when an outcome can only take discrete values (e.g. number of birds)
  • Continuous : when outcomes take continuous values (e.g. blood pressure)

Distributions: basic visualisation

Every horizontal line you draw can be interpreted as a continuous distribution. Every barplot as a discrete distribution.

All the distributions we are going to discuss are variations of these two basic types!

For parametric distributions, we have a formula that describes the line / bars. You just put in the parameters, and the output is the line / bars.

Discrete distributions: PMF

A probability mass function (pmf) describes the probability distribution of discrete variables.

Consider a toin coss:

\[ f(x) = \begin{cases} 0.5 & x \text{ is head} \\ 0.5 & x \text{ is tails} \end{cases} \]

This is the pmf of the Bernoulli distribution

Conditions for a PMF

  1. An event cannot have a negative probability
  1. The sum of probabilities of all events must be 1
  1. The probability of a subset \(X\) of outcomes \(T\) is the same as adding the probabilities of the individual elements.

Conditions for a pmf

Mathematical description

The probability is a function \(f\) over the sample space \(\mathscr{S}\) of a discrete random variable \(X\), which gives the probability that \(X\) is equal to a certain value. \[f(x) = P(X = x)\]

Each pmf satisfies these conditions: \[ \begin{align} f(x) \geq 0 ,\, \forall x \in X\\ \Sigma_{x \in \mathscr{S}} \, f(x) = 1 \end{align} \]

for a collection \(\mathscr{A}\) \[P(\mathscr{A} \in \mathscr{S}) =\Sigma_{t_i \in \mathscr{A}} f(t)\]

Continuous distributions: PDF

For continuous distributions, we use a probability density function (pdf).

  1. \(f(x) > 0 ,\, \forall x \in X\)
  1. The integral of the probabilities of all possible events must be 1 (area under the curve)
  1. The probability \(X\) of values in the interval \([a,b]\) is the integral from \(a\) to \(b\)

Continuous distributions: Conditions for a PDF

This might look like unnecessary mathematical details. But it is actually important to understand the difference.

Example: can you answer the question “What is the probability your body temperature is 37.0 C?”

The answer might be unexpected: 0!

Let’s say your answer is 25%. But what if your temperature is 37.1? does that count? Or 37.01?

Because the distribution is continuous you can only say something about the range “What is the probability your temperature is between 36.5 and 37.2 C?”

Quiz time

quiz

Discrete Distributions

Bernoulli

We will look at three different discrete distributions.

The simplest case for discrete is a barplot with just two options.

We call this “Bernoulli distribution”, named after Jacob Bernoulli (1655-1705). He also discoverd the constant \(e\) and published work on infinite series.

Bernoulli

  • used to model binary events, like a coin toss
  • there is just one parameter \(p\)
  • Note how you can transform almost everything into a Bernoulli distribution!

The pmf is: \[ f(x) = \begin{cases} p & x \text{ is head} \\ 1-p & x \text{ is tails} \end{cases} \]

Uniform

  • Everything has an equal probability
  • If you have 10 options, every option has a 1/10 chance
  • this can be a mathematical way of saying: “I have no clue what will happen, so let’s make no assumptions”
  • there is also a continuous varation of this distribution

The pmf is: \[ f(n) = \frac{1}{n}\]

Poisson

  • the probability of the number of times an event occurs in a fixed interval of time
  • the events must occur with a constant rate
  • All you need is an expected average, the parameter \(\lambda\) (lambda)

Question: think about the calls a callcenter receives. When can you use a Poisson?

You will need a constant rate! While 9h-17h is a fixed interval of time, the rate is probably not constant.

But maybe 9h-12h is! You can use multiple poissons for every timeframe!

Poisson

  • the probability of the number of times an event occurs in a fixed interval of time
  • the events must occur with a constant rate
  • All you need is an expected average, the parameter \(\lambda\) (lambda)

There is one parameter \(\lambda \in (0, \inf)\)

The pmf is: \[ f(X=k) = \frac{\lambda e^{-\lambda}}{k!}\]

Quiz

quiz time

Continuous

Normal Distribution: Central Limit Theorem

This is one of the distributions that is used most often.

A major reason for this is, that if you keep sampling and adding from a population you always end up with a normal distribution.

Normal Distribution: Central Limit Theorem

Take a persons height.

  • This is determined by a combination of 180 genes.
  • One gene will contribute to a longer neck, the other to longer legs
  • If we assume the genes contribute independently, height equals the sum of 180 genes.

Thus, height will be normally distributed. So will the weight of wolves or the length of a penguins wing.

Log-Normal Distribution

However, multiplying values will give you a long tail!

This is the case when variables interact in some way, and are not independent.

\[4 + 4 + 4 + 4 = 16\]

but

\[4 * 4 * 4 * 4 = 256\]

Log-Normal Distribution

However, multiplying values will give you a long tail!

This is the case when variables interact in some way, and are not independent.

\[4 + 4 + 4 + 4 = 16\]

but

\[4 * 4 * 4 * 4 = 256\]

This is common if variables interact with each other. Examples are stock prices, failures of machines, ping times on a network, income distribution.

Log-Normal Distribution

multiplying values will give you a fat-tail distribution! This will typically be a log-normal distribution:

if \(X\) is log-normal distibuted, then \(Y = log(X)\) will be a normal distribution.

Normal Distribution

The normal distribution has two parameters:

  • mean
  • standard deviation

The most basic form of the pdf is \[f(x) =e^{-(x)^2}\]

Normal Distribution

  • To make the area under the curve sum to 1 we add \(\frac{1}{\sqrt{2\pi}}\).
  • To get unit variance (\(\sigma^2 = 1\)) we add the \(\frac{1}{2}\) to the exponent.

The full pdf is:

\[f(x; \mu, \sigma) = \frac{1}{\sigma\sqrt{2 \pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]

Exponential

The exponential distribution is connected to the Poisson distribution.

You can use it to model the time between independent events in a Poisson process

Exponential

The assumption of a constant rate is rarely satisfied, but for our models it is often a good enough approximation.

the pdf: \[f(x, \lambda) = \lambda e^{-\lambda x}\]

Beta

The beta distribution describes the probability of probabilities

It is typically used to describe the evolution of an informed belief.

Beta

Let’s say you want to determine if a coin is fair. The probability should be somehwere beteen 0.0 and 1.0, that’s all you can say at first.

The range must always between 0.0 and 1.0!

This is represented by \(Beta(1,1)\) (a continuous uniform distribution!)

If you start counting heads and tails, your belief will change over time

Beta

The evolution could look like this:

  • With 1 heads, 2 tails you are not very certain of anything.Only thing you know for sure is that it wont always throw heads.
  • With 2 heads, 2 tails your best guess would be that is is probably fair (\(p=0.5\)). But \(p=0.3\) or \(p=0.7\) would still be a real possibility.
  • With 2 / 8, you will start believing that the coin is probably biased to throw heads just 20% of the time. But there is still a range of uncertainty

Beta

If you keep tossing, you will get more certain.

There are two parameters, \(\alpha\) and \(\beta\). The easiest way to interpret them is as the amount of coinflips head vs tails.

The pdf is a bit complex because it involves another function \(\Gamma\) :

\[f(x; \alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\]

Quiz

quiz time

Statistical Modelling

Motivation

People thing in terms of stories - thus the unreasonable power of the anecdote to drive decisio-making.1

Motivation

But existing analytics often fails to provide this kind of story. Instead, numbers seemingly appear out of thin air.1

Motivation

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and scientific persuasion.1

Epistemic vs Aleatoric

  • We consider \(X\) as a random variable not because there is no single true value of \(X\), but because we don’t know what the true value is.
  • Expressing \(X\) as a random variable is our way to express our lack of knowledge about the true value of \(X\)

Epistemic vs Aleatoric

  • Uncertainty that is due to lack of knowledge is called epistemic uncertainty
  • There is also uncertainty that you will never overcome, the exact value of your next cointoss.This type of randomness/variability is called aleatoric uncertainty and is represented by the noise parameter \(\epsilon\)

It is important to discern between the two! There are limits to what you can do with your models!

Models

A model has parameters.

Let’s say we use a simple linear model:

\[f(x)= ax + b\]

We could use data to estimate the optimal value for \(a\) and \(b\). Let’s say we find \(a=1.0\) and \(b=5.0\).

But how certain are we about our output?

Could it be \(b=5.1\)? Or even \(b=5.5\)? Or \(b=6.0\)?

Models

We can describe our parameters as distributions.

This way we could even model events with small datasets!

E.g.

\[a \sim \mathscr{N}(\mu, \sigma)\]

Meaning: the parameter \(a\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

Models

We can now express our uncertainty by saying: The value of \(a\) has a mean \(\mu=1.0\) and standard deviation \(\sigma=0.2\).

Caveats: the Simpson’s paradox

The shaded area is the 99% confidence interval.

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Now I tell you that the \(x\) axis is the amount of hours invested in study, and the \(y\) axis is the average grade of a student.

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)

Caveats: the Simpson’s paradox

Raise your hand if you think it is reasonable to say that increasing variable \(x\) will lower variable \(y\)